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Abstract 

In many real-world classification problems, the labels of training ex- 
amples are randomly corrupted. Previous theoretical work on classifi- 
cation with label noise assumes that the two classes are separable, that 
the label noise is independent of the true class label, or that the noise 
proportions for each class are known. In this work we give weaker condi- 
tions that ensure identifiability of the true class-conditional distributions, 
while allowing for the classes to be nonseparable and the noise levels to 
be asymmetric and unknown. Under these conditions, we also establish 
the existence of a consistent discrimination rule, with associated estima- 
tion strategies. The conditions essentially state that most of the observed 
labels are correct, and that the true class-conditional distributions are 
"mutually irreducible," a concept we introduce that limits the similarity 
of the two distributions. For any label noise problem, there is a unique pair 
of true class-conditional distributions satisfying the proposed conditions, 
and we argue that this pair corresponds in a certain sense to maximal de- 
noising of the observed distributions. Both our consistency and maximal 
denoising results are facilitated by a connection to "mixture proportion 
estimation," which is the problem of estimating the maximal proportion 
of one distribution that is present in another. This work is motivated by 
a problem in nuclear particle classification. 
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1 Introduction 



In binary classification, one observes multiple realizations of two different classes, 



where p> and Pi, the class-conditional distributions, are probability distribu- 
tions on a measurable space (X, &). The feature vector Xf e X denotes the 
i-th realization from class y £ {0, 1}. The general goal is to construct a classifier 
from this data. 

There are several kinds of noise that can affect a classification problem. A 
first type of noise occurs when P and Pi have overlapping support, meaning 
that the label is not a deterministic function of the feature vector. In this 
situation, even an optimal classifier makes mistakes. In this work, we consider 
a second type of noise, label noise, that can occur in addition to the first type 
of noise. With label noise, some of the labels of the training examples are 
corrupted. We focus in particular on random label noise, as opposed to feature- 
dependent or adversarial label noise. 

To model label noise, we represent the training data via contamination mod- 



According to these mixture representations, each "apparent" class-conditional 
distribution is in fact a contaminated version of the true class-conditional distri- 
bution, where the contamination comes from the other class. Thus, Po governs 
the training data with apparent class label 0. A proportion 1 — 7r of these 
examples have as their true label, while the remaining ttq have a true label 
of 1. Similar remarks apply to P\. The noise is asymmetric in that ir need 
not equal tt±. We emphasize that ttq and 7Ti are unknown. The distributions Po 
and Pi are also unknown, and we do not wish to impose models for them. In 
particular, the supports of P and Pi may overlap, so that the classes are not 
separable. 

Previous work on classification with random label noise, reviewed below, has 
not considered the problem in this generality. Our contribution is to introduce 
general sufficient conditions on the elements Po, Pi, tto, -k\ of the contamination 
models for the existence of a consistent discrimination rule; these conditions are 
the following: 

• (Total noise level) n + ni < 1, 

• (Mutual irreducibility) It is not possible to write P as a nontrivial mixture 
of Pi and some other distribution, and vice versa. 



A 0i • ' • J A ~ -M)j 



els: 



X\, . . . , A m Po := (I - 7r )P + ttqPi, 
Xl,..., XI Pi := (I - 7Tl)Pl + 7nP . 



(1) 

(2) 
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We present a consistent discrimination rule that leverages consistent estimates 
of the noise proportions. These proportions are recovered in turn via mixture 
proportion estimation, which is the problem of estimating the proportion of one 
distribution present in another, given random samples from both distributions. 

To shed some light on these conditions, we remark that in the absence of 
any assumption, the solution (Po, Pi,7To,7Ti) to Q-Q, when the contaminated 
distributions Po, Pi are given, is non-unique. In particular, were the condition on 
total label noise not required, for any solution, swapping the role of classes and 

I would also be a solution (with complementary contamination probabilities), 
while leaving the apparent labels unchanged. 

Furthermore, we describe in detail (at the population level) the geometry 
of the set of all possible solutions (P , Pi, ir , tt\ ) to 0-j2). We argue that for 
any pair Pq ^= Pi, there always exists a unique solution satisfying the above 
two conditions. Moreover, this solution uniquely corresponds to the maximum 
possible total label noise level (tti + tto) compatible with the observed contami- 
nated distributions, and also to the maximum possible total variation separation 

I I Pi — Po| | tv under the condition 7Ti + 7To < 1. In this sense, Pq and Pi satisfy- 
ing the second condition are maximally denoised versions of the contaminated 
distributions. Under these conditions, we therefore establish universally consis- 
tent learning of (i) a classifier that compensates for everything that could be 
construed as label noise, and (ii) the corresponding contamination proportions. 
In particular, we emphasize that the proposed conditions do not put any restric- 
tions on the possible apparent label distributions Po, Pi, so that our consistency 
result is distribution-free. 

An alternative way to view the contamination model ([!])- (|2j) is to interpret 
it as a source separation problem. In the usual source separation setting, the 
realizations from the different sources are linearly mixed, whereas in the present 
model, the source probability distributions are (we do not observe a signal su- 
perposition, but a signal coming from one or the other source). As a common 
point with the source separation setting, it is necessary to postulate additional 
constraints on the sources in order to resolve non-uniqueness of the possible 
solutions. In Independent Component Analysis, for instance, sources are as- 
sumed to be independent. Our assumption of mutual irreducibility between the 
sources plays a conceptually comparable role here. Similarly, the assumption on 
the total noise level resolves the ambiguity that the sources would be otherwise 
only identifiable up to permutation. 

1.1 Problem Statement and Notation 

We consider the problem of designing a discrimination rule, in the presence 
of label noise, that is consistent with respect to a given performance measure. 
To state the problem precisely, we define the following terms. A classifier is a 
measurable function / : X — > {0, 1}. A performance measure R(f) assigns every 
classifier to a nonnegative real number, and depends on the true distributions, 
Po and Pi. The optimal performance measure is denoted R* = inf R(f), where 
the infimum is over all classifiers. A discrimination rule is a function / m) „ : 
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X m x X n — > (X — > {0, 1}) mapping training data to classifiers. A discrimination 
rule is consistent iff R(f mn ) —> R* in probability as min{m,n} — > oo. 

We focus on the minmax criterion, for which R(f) = max{i? (/), Ri(f)}, 
where 

Ro(f) ■= Po(f(X) = 1) 
Ri(f) :=Pi(f(X) = 0) 

are the Type I and Type II errors. The optimal performance R* is called the 
minmax error . This choice of performance measure is primarily for concreteness; 
we expect no difficulty in extending our analysis to other performance measures, 
both frequentist and Bayesian, that can be defined in terms of Rq and Ri, 
such as Neyman- Pearson or expected misclassification cost. This is because our 
approach is grounded on a technique to estimate Ro(f) and 
We also introduce the contaminated Type I and II errors: 

Mf) ■= Po(f(x) = i) 

= (l-7T )iio(/)+7ri(l-fli(/)) (3) 

Ri(f) :=Pi(/(X)=0) 

= (l-7r 1 ) J R 1 (/)+7ro(l--Ro(/)). (4) 



1.2 Motivating Application 

This work is motivated by a nuclear particle classification problem that is criti- 
cal for nuclear nonproliferation, nuclear safeguards, etc. An organic scintillation 
detector is a device commonly used to detect high-energy neutrons. When a 
particle interacts with the detector, the energy deposited by the particle is con- 
verted to a pulse-shaped voltage waveform, which is then digitally sampled to 
obtain a feature vector X £ M. d , where d is the number of digital samples. 
The energy distribution of detected neutrons is characteristic of the nuclear 
source material, and these energy distributions can be inferred from the heights 
of the observed pulses. However, these detectors are also sensitive to gamma 
rays, which are frequently emitted by the same fission events that produce neu- 
trons, and which are also strongly present in background radiation. Therefore, 
to render organic scintillation detectors useful for characterization of nuclear 
materials, it is necessary to classify between neutron and gamma-ray pulses, a 



problem referred to as pulse shape discrimination (PSD) (Adams and White 



1978 Ambers et al. 2011) 



Unfortunately, even in controlled laboratory settings, it is very difficult to 
obtain pure samples of neutron and gamma-ray pulses. As previously men- 
tioned, the fission events that produce neutrons also yield gamma rays, and 
gamma rays also arrive from background radiation. Although pure gamma-ray 
sources do exist, when collecting measurements from such sources, neutrons 
from the background cannot be completely eliminated. If we view gamma-rays 
as class 0, by taking a strong and pure gamma-ray source, 7r will be small but 
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nonzero. On the other hand, the proportion of gamma-rays emitted during fis- 
sion is intrinsic to the source material, and cannot be changed. Thus tt\ could 
be in the neighborhood of one-half. With additional time-of-flight information, 



this proportion can be reduced, but is still non- negligible (Ambers et al. 2011 ) 



Thus, PSD is naturally described by the proposed label noise model. 



1.3 Related Work 

Classification in the presence of label noise has drawn the attention of numerous 
researchers. One common approach is to assume that corrupted labels are more 
likely to be associated with outlying data points. This has inspired methods 



bapragada and Brodley 2007 


). as wel 


as the use of robust (usually nonconvex) 


losses (|Mason et al. 


2000 Xu et al. 


2006 Masnadi-Shirazi and Vasconcelos 


2009 Ding and Vishwanathan 


2010 Denchev et al. 2012). The above ap- 



proaches are not necessarily based on a random label noise model, but rather 
assume that noisy labels are more common near the decision boundary. 

Generative models have also been applied in the context of random label 
noise. These impose parametric models on the data-generating distributions, 
and include the label noise as part of the model. The parameters are then 



estimated using an EM algorithm (Bouveyron and Girard 2009). The method 



of Lawrence and Scholkopf (2001 ) employs kernels in this approach, allowing for 
the modeling of more flexible distributions. 

Negative results for convex risk minimization in the presence of label noise 



have been established by Long and Servido (2010) and Manwani and Sastry 



(2011). These works demonstrate a lack of noise tolerance for boosting and 
empirical risk minimization based on convex losses, respectively, and suggest 
that any approach based on convex risk minimization will require modification 
of the loss, such that the risk minimizer is the optimal classifier with respect 
to the uncontaminated distributions. Along these lines, [Stempfel a nd Ralaivola 
(2009) recently developed a support vector machine with a modified hinge loss. 



Proper modification of the loss, however, requires knowledge of the noise propor- 
tions. Since these proportions are typically not known a priori, our consistent 
estimators of these proportions could make approaches based on convex risk 
minimization more broadly applicable. 

Classification with random label noise has also been studied in the PAC liter- 
ature. Most PAC formulations assume that (i) Po and Pi have non-overlapping 
support (i.e., there is a deterministic "target concept" that provides the true 
labels), (ii) the label noise is symmetric (i.e., independent of the true class la- 



bel), and (iii) the performance measure is the probability of error ( Angluin and 


Laird||1988| |Kearns 


1993 


Aslam and Decatur 


1996| |Cesa-Bianch et al.| 1997; 


Bshouty et al. 


1998 


Kalai and Servedio 


20031. Under these conditions, it 



typically suffices to train on the contaminated data; only the sample complex- 
ity changes. The case of asymmetric label noise was addressed by |Blum and 
Mitchell ( 1998 1 under (i), as the basis of co-training. Some new directions and a 



thorough review of this body of work were recently presented in Jabbari ( 2010 ). 
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As we discuss in the next section, new challenges emerge when (i), (ii), and (iii) 
are not assumed. 

To our knowledge, previous work under the asymmetric noise model has not 
addressed a minimal set of conditions for either consistent classification or for 
consistent estimation of the label noise proportions. 

Classification with label noise is related to several other machine learning 
problems. It is the basis of co-training (Blum and Mitchell 1998). When tti = 0, 



we have "one-sided" label noise, and the problem reduces to learning from pos- 
itive and unlabeled examples (LPUE), also known as semi-supervised novelty 
detection (SSND); see Blanchard et al. (2010) for a review of this literature. 
In particular, Blanchard et al. (2010) develop theory for "mixture proportion 
estimation" that we leverage in our analysis. A basic version of multiple in- 
stance learning can be reduced to classification with one-sided label noise (see 
Sabato and Tishby 2012). Finally, below we establish a connection between 



classification with label noise and class probability estimation. 



1.4 Outline 

The remainder of the paper is outlined as follows. Section [2] discusses the chal- 
lenges posed by label noise for classifier design. Section [3] presents an alternate 
representation of the contamination models that reduces the problem to that 
of mixture proportion estimation, which is discussed in Section 4j along with 
distributional assumptions and maximal denoising. In Section [5 we introduce 
estimates of Type I and Type II error, and show that, under the proposed condi- 
tions, they satisfy a uniform law of large numbers. In Section [6] we focus on the 
minmax criterion and present a consistent minmax classifier. Section W\ provides 
additional discussion of mixture proportion estimation, and Section ^ makes a 
connection between our work and the problem of class probability estimation. 
Proofs of results arc found cither in the body of the paper, or in an appendix. 



2 The Challenge of Label Noise 

In this section, we address the challenges posed by label noise. We focus on 
the population setting (m, n — oo) and compare classifier design based on the 
contaminated distributions, Pq and Pi, versus the true ones, Pq and P\. We 
introduce the following condition on the total amount of label noise. 

(A) 7T +7Tl < 1. 

This condition states, in a certain sense, that a majority of the labels are correct 
on average. It even allows that one of the proportions be very close to one if 
the other proportion is small enough. This condition was previously adopted 



by |Blum and Mitchell| ( |1998[ ). 

In this section, we assume that Pq and Pi are absolutely continuous with 
respect to Lebesgue measure. Let po and p± denote corresponding densities. 
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Thus 



p (x) := (1 - 7t )p (ie) + n pi(x), 
pi(x) := (1 - ni)pi(x) + TTxpoix), 



are respective densities of P and i\. 

Proposition 1. Assume (A) holds. 
Po(x) > and po(x) > 0, 



For all 7 > 0, and every x such that 



__ > 7 — — > A, 
Po(x) Po{x) 

where 

A= 7ri+7(1 ' ?ri) . (5) 

The proof involves a sequence of simple algebraic steps to transform one 
likelihood ratio into another, and the use of (A) to ensure that the direction of 
the inequality is preserved. 

Regardless of the performance measure chosen (probability of error, Neyman- 
Pearson, etc.), the optimal classifier takes the form of a likelihood ratio test 
(LRT) based on the true densities. According to the proposition, every true LRT 
is identical to a contaminated LRT with a different threshold. As the threshold 
of one LRT sweeps over its range, so too does the threshold of the other LRT. 
Equivalently, both LRTs generate the same receiver operating characteristic 
(ROC). 

However, if we design a classifier with respect to the contaminated Type I 
and II errors, we will not obtain a classifier that is optimal with respect to the 
true Type I and II errors, except in very special circumstances. To make this 
point concrete, we now consider three specific performance measures. 

Probability of error. When the feature vector X and label Y are jointly 
distributed, the probability of misclassification is minimized by a LRT, where 
the threshold 7 is given by the ratio of a priori class probabilities. If 7 = 1, 
then the corresponding threshold for the contaminated LRT is also 1, regardless 
of 7To and 7Ti, which follows directly from ([5]). Furthermore, assuming no, tt\ > 
and with some simple algebra it is easy to show that A = 7 only if 7 = 1. Thus, 
if the two classes are not equally probable a priori, setting the correct A for the 
contaminated LRT is not possible, since ttq and 7Ti are unknown. 

Neyman-Pearson. As noted above, the true and contaminated LRTs have 
the same ROC. If a point on this ROC is chosen such that Ro(f) = a, it 
will generally not be the case that i?o(/) = a - This follows because i?o(/) = 
(1 — 7T )i? (/) + 7r o^i(/)- Simple algebra shows that Ro(f) = Ro{f) iff = or 
Ro(f) + Ri(f) — 1. The latter condition is not satisfied by an optimal classifier 
unless Pq — Pi, since it corresponds to random guessing. The former case, 
7To = 0, means the negative class has no contamination, and is equivalent (after 
swapping class labels) to learning from positive and unlabeled examples. 
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Minmax. The minmax classifier corresponds to the point on the ROC of 
the true and contaminated LRTs where Ro(f) = Ri(f)- Indeed, if Ro(f) ^ 
Ri(f), then max{iio(/),-Ri(/)} can be decreased by moving along the ROC 
such that the larger of R {f),Ri(f) is decreased. Thus, designing a classifier 
with respect to the contaminated distributions yields a point on the optimal 
ROC where Ro(f) — Ri(f)- Using equations ^ and Q, simple algebra reveals 
that Ro(f) = Ri(f) and Ro(f) — Ri(f) for the same / iff 7To = Ti or Ro(f) = 
Rl(f) — \- The first condition is not satisfied for asymmetric label noise, and 
the latter condition is not true for an optimal classifier unless Pq= P\. 

In summary, a classifier that is optimal with respect to the contaminated 
Type I and II errors is not optimal with respect to the true Type I and II 
errors, except in special cases. Based on the above discussion, in the setting 
of asymmetric, random label noise, it is essential to have accurate estimates 
of true Type I and Type II errors. These estimates, in turn, facilitate the 
design of discrimination rules with respect to any criterion. For concreteness, 
in later sections we examine the minmax criterion in detail. However, our 
approach readily extends to other performance measures that are based on the 
false positive and negative rates. 

3 Alternate Mixture Representation 

We introduce an alternative mixture representation that facilitates our subse- 
quent analysis. The following lemma reformulates the problem. 

Lemma 1. If Pq ^ Pi and (A) holds, then P\ ^ Pq, and there exist unique 
< ttoj Tt\ < 1 such that 

Po = (1 - *o)Po + *oPi (6) 

Pi = (l-7T 1 )Pi+ TTiPo. (7) 

In particular ttq — 1 Z° 7ri < 1 an d = yz^~ < 1 • 

Proof. To see that Pi ^ Pq, assume by contraposition that equality holds. 
Plugging in Q-Q, we obtain 

(1 - TTi - 7r )P! = (1 - 7Ti - 7r )P , 

which, since Po ^ Pi, would imply 7Ti + ttq = 1 and contradict (A). 

We turn to identity ([6]). Matching distributions, the identity holds iff 

Pl(7T - 7T (1 - 7Tl)) = P (l - 7T + ^1^0 - (1 - 7T )) 
= Po(7T - 7T (1 - 7Tl)). 

Since Po 7^ P\, the unique solution is tt = ^ . From (A) it follows that 
ttq < 1. Similar reasoning applies to the second identity. □ 
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This lemma motivates estimates of the true Type I and Type II errors. For 
any classifier /, we may express the contaminated Type I and Type II errors as 

Mf) = WW = i) 

= (l-7f )iJo(/)+7fo(l--Rl(/)) (8) 

Mf) = A(/(x) = o) 

= (l-7f 1 )fl 1 (/)+7f 1 (l- J R (/)), (9) 

where Equations pi) and ^ follow from Lemma [I] By solving for Ro(f) an d 
Ri{f) in ([§} and @, we find 

flo(/) = MzMzM)) = i-^/)- 1 -^-^ (io) 

1 - 7T 1 - 7T 

fll(/) = ^(/)-^-w)) = ij o(/) -h%M). (11) 

1 — 7TJ 1 — Hi 

We can estimate Ro(f) and Ri{f) from the training data. Therefore, if we can 
estimate ttq and 7Ti, then we can estimate Ro(f) and i?i(/), and thereby design 
a classifier. In the next section we address the estimation of ttq and n\. Note 
that it is not necessary to estimate ttq and tt%, although that would be possible 
in light of Lemma [T] 

We conclude this section with a converse to Lemma [T] 



Lemma 2. Assume that ^ hold and Pi ^ Pq. Then Pi ^ Pq and there 

and ni = ^ l(] r! o) 



exist unique 7ri,7To G [0,1) (namely ttq — ^j 3 ^ 1 - and 7Ti = 7r 1 1 ^ 1 - ~°' ) J so £/iai 



hold; furthermore, (A) is satisfied. 

Proof. Assume ([6|-([7| hold. Since we assume Pi ^ Pq, it holds that n%, ttq < 1. 
To see that Pq 7^ Pi, assume by contraposition that equality holds. Plugging in 
([6|-([7| and after straightforward manipulation, we obtain equivalently 

1 - TTl^O „ 1 - ^1^0 5 



(1 - 7fl)(l - 7f ) (l-7fl)(l-7T0)' 

which would contradict the assumption Pi ^ Pq. 

Next, in order for identity ([T|) to hold, by matching distributions in a similar 
way as in the proof of Lemma 111 we arrive at the equivalent relation (ttq(1 — 
ti) — 7r o)Po = (^o(l — 7ri) — 7To)Pi- Since Pi ^ Pq, the unique solution is 
ttq = 7To(l — ti). Similarly, for (J2J to hold the unique solution is ttq = 7To(l — n i)- 
From these we derive the announced expression for 7i"o,7ri. It is then easy to 
check that tt + m - 1 = - {1 ~?ll { ^* o) < 0, so that (A) holds. □ 

Together, Lemmas [T] and [2] imply that for known, distinct uncontaminated 
distributions Pq =/= Pi, there is an explicit one-to-one correspondence between 
the contamination proportions (7Pl,7To) of the initial contamination models Q- 
(2 1 under constraint (A), and the proportions (7Ti,7ro) m the representation 
(6|-([7| (with the only constraint < 7fi,7fo < 1). 
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The alternate representations Q-Q are decoupled in the sense that (|6| 
does not involve Pi, while |7|) does not involve Pq. This allows us to estimate 
7To an d 7Ti separately, by reducing to the problem of "mixture proportion es- 
timation." It further motivates the mutual irreducibility condition on (P ,Pi) 
that, together with (A), ensures that fro, tti are identifiable. The decoupling 
perspective also allows us to address the following question: Given the contam- 
inated distributions Pi,Pq, while (Pq,P\) are unknown, what are the solutions 
(tto, 7Ti, Pq, Pi) satisfying model ([T])-([2])? Obviously, (0,0, Pi,Pq) is a trivial 
solution; we will argue that mutual irreducibility ensures that the solution is 
unique and non-trivial, and furthermore that the resulting Pq,P\ correspond 
to maximally denoised versions of P\,P$. The issues are developed in the next 
section. 



4 Mixture Proportion Estimation and Mutual 
Irreducibility 

Let F, G, and H be distributions on {X, &) such that 

F= {l-v)G + vH, 

where < v < 1. Mixture proportion estimation is the following problem: 
given iid training samples € X m and 6 X n of sizes m and n from F 
and H respectively, and no information about G, estimate v. This problem 



was previously addressed by Blanchard et al. (2010), and here we relate the 



necessary definitions and results from that work. 

Without additional assumptions, v is not an identifiable parameter, as noted 
by Blanchard et al. In particular, if F = (1 — v)G + vH holds, then any 
alternate decomposition of the form F = (1 — v + 8)G' + (y — 5)H , with G' — 
(1 — v + i5) _1 ((l — v)G + SH) , and <5 € [0, v) , is also valid. Because we have 
no direct knowledge of G , we cannot decide which representation is the correct 
one. Therefore, to make the problem well-defined, we will consider estimation 
of the largest valid v. The following definition will be useful. 

Definition 1. Let G , H be probability distributions. We say that G is irre- 
ducible with respect to H if there exists no decomposition of the form G — 
"fH + (1 — "/)F f , where F' is some probability distribution and < 7 < 1 . We 
say that G and H are mutually irreducible if G is irreducible with respect to H 
and vice versa. 

The following was established by Blanchard et al. 

Proposition 2. Let F , LI be probability distributions. Lf F ^ H , there is 
a unique v* € [0, 1) and G such that the decomposition F = (1 — v*)G + v*LL 
holds, and such that G is irreducible with respect to H . Lf we additionally define 
v* = 1 when F = H , then in all cases 

v* = max{a £ [0, 1] : 3 G' probability distribution: F = (1 — a)G + aH} . 
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By this result, the following is well-defined. 
Definition 2. For any two probability distributions F, H , define 
v* (P, H) := max{a £ [0, 1] : 3 G' probability distribution: F = (1— a)G' +aH} . 

Clearly, G is irreducible with respect to H if and only if v*{G,H) = 0. 
Additionally, we show in Section [7] that for any two distributions F and H, 
v*(F, H) = inf^ge F{A) / H{A). Similarly, when F and H have densities / and 
h, v*(F,H) = ess inf^gsupp^) f(x)/h(x). These identities make it possible to 
check irreducibility in different scenarios. For example, v* (G, H) = whenever 
the support of G does not contain the support of H. Even if the supports 
are equal, two distributions can be mutually irreducible, as in the case of two 
Gaussians with distinct means and equal variances. See Section [7] for additional 
discussion of mutual irreducibility. 

To consolidate the above notions, we state the following corollary. 

Corollary 1. If F = (1 — j)G + jH, and G is irreducible with respect to H, 
then 7 = v*{F,H). 

Blanchard et al. also studied an estimator v = V(Z 7 p,Z^ I ) of v*(F,H). 
They show that v is strongly universally consistent, i.e., that for any F and 
H, v — > v*(F,H) almost surely. The particular form of the estimator is not 
important here; only its consistency is relevant for our purposes. See Section [7] 
for some intuition for this estimation problem. 

Lemma [l] allows us to estimate 7To and tt\ using v. Recalling the result of 
Lemma [l] the distributions Pq and Pi can be written 

Po = (l-n )P + n Pi 
Pi = (l-xi)Pi+7TiP - 

By Corollary [l] we can estimate ttq and tti provided the following condition 
holds: 

(B) P is irreducible with respect to Pi and Pi is irreducible with respect to 

To ensure this condition, we now introduce the following idcntifiability assump- 
tion: 

(C) Po and Pi are mutually irreducible. 

Note that it follows from assumption (C) that Pq ^ P±. We now establish that 
(C) and (B) are essentially equivalent. 

Lemma 3. Po is irreducible with respect to Pi if and only if Pq is irreducible 
with respect to Pi and tti < 1. The same statement holds when exchanging the 
roles of the two classes. In particular, under assumption (A) , (C) is equivalent 
to (B) . 
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Proof. This will be a proof by contraposition. Assume first that Pq is not 
irreducible with respect to Pi. Then there exists a probability distribution Q' 
and < 7 < 1 such that 

Pa = iPi + (1 - l)Q'- 
Now, plugging in Equation (2) for i\ yields 

P = 7((1 " *\)Pi + kiPo) + (1 - l)Q'- 
Solving for P produces 

P = (1-J3)Q' + I3P 1 , 

where /3 = 7( t _ 7 ^ l )- Now, in the case where 7Ti < 1, then 1 — 77ri > 0, and 
7 — 771- 1 > 0. Since < 7 < 1, we deduce < f3 < 1, so that Po is not irreducible 
with respect to Pi. 

Conversely, assume by contradiction that Po is not irreducible with respect 
to Pi, i.e., there exists a decomposition P = 7P1 + (1 — ^)Q' with 7 > 0. Then 
the decomposition P = (3 Pi + (1 — (3)Q' holds with (3 = 7+ ( 1 ^J 1 )( 1 _ 7 ) € (0; 1]) 
so that Po is not irreducible with respect to Pi. Finally, in the case tti = 1, wc 
have Pi = Po, in which case, trivially, Po is not irreducible with respect to Pi 
either. □ 

To summarize, if (A) and (C) hold, then we can consistently estimate ttq 
and 7Ti, and therefore can also consistently estimate Ro{f) and Pi(/) via Eqns. 



( 10 )-( 11 ) . These ideas are developed in the next section. 

To conclude this section, we present a result that rounds out the discussion 
of the initial and modified contamination models, and mutual irreducibility. 
In particular, we describe all possible solutions (ttq, 7Ti, Po, Pi) to our model 
equations 0-([2]) when Pq,Pi are given and arbitrary, and an equivalent char- 
acterization of the unique mutually irreducible solution. It can be seen as an 
analogue of Proposition [2] for the label noise contamination models. 

Theorem 1. Let Pi 7^ p) be two given distinct probability distributions. Denote 
by A the feasible set of quadruples (ttq, tti, Po, Pi) such that (A) and equations 
are satisfied. 

1. There is a unique quadruple (tTq, 7rJ , P *, P* ) G A so that (C) holds. 

2. Denoting ttq := ^*(P ,Pi) < 1 and n* := v*(Pi,P Q ) < 1, it holds 

_ + 7rS(l - TfJ) . 7f;(i-ffS) 



1~*~# I 1 1 

-71x71-0 I-TTiTTq 



(12) 



3. The feasible region R for the proportions (7ro,7Ti) (that is, the projection 
of A to its first two coordinates, which is also one-to-one), is the closed 
quadrilateral defined by the intersection of the positive quadrant 0/K 2 with 
the half-planes given by 

TTO + 7ri7To < 7To, 7Tl + 71-07^ < 7T* . (13) 
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4- The mutually irreducible solution (ttq, 7rJ, Pq , P*) is also equivalently char- 
acterized as: 

• the unique maximizer of (ttq + 7Tx) over A; 



• the unique extremal point of A where both of the constraints in ( 13 1 
are active; 

• the unique maximizer over A of the total variation distance \\Pq — -Pi||jiy 

The proof of the theorem relies on the explicit one-to-one correspondence 
established in Lemmas [T] and [2] between the solutions of the original decomposi- 
tion ([lJl-Q and its decoupled reformulation (|6])-([7|- The result of Proposition [2] 
is applied to the decoupled formulation, then pulled back, via the correspon- 
dence, in the original representation. The last statement concerning the total 
variation norm is based on the relation 

(Pi - Po) = (i - 7T - ^r^A - P ), 

obtained by subtracting ([I]) from Therefore, the maximum feasible value of 
1 1 Pi — Pq\\ tv corresponds to the maximum of (ttq+tti), i.e. the unique mutually 
irreducible solution. 

The geometrical interpretation of this theorem is visualized on Figure [TJ 
In particular, point 1 of the theorem shows that conditions (A) and (C) do 
not restrict the class of possible observable contaminated distributions (Pi,Po); 
rather, they ensure in all cases the identifiability of the mixture model. Point 4 
indicates that the unique solution satisfying the mutual irreducibility condition 
(C) can be characterized as maximizing the possible total label noise level 
(•/To + 7Ti), or, still equivalently, the total variation separation of the source 
probabilities P ,Pi. In this sense, the mutually irreducible solution can also 
be interpreted as maximal label denoising or maximal source separation of the 
observed contaminated distributions. 



5 Estimating Type I and Type II Errors 

We denote the training data by Z " = (A \ Xtf) E X m , and Zf — (X}, Xf) € 
X n . Given a classifier /, and iid samples Z™ and Z", we define the following 
estimates of the contaminated Type I and Type II errors: 

^ m ^ 1 n 

i=i »=i 

Following the theory developed in Section [3| define the estimates of no and 
ffi as 

Mz^\z-) = V(Z™,Z?), 
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Figure 1: Geometry of the feasible region A for proportions (ttq, tti) solutions of 
the contamination model ([lJI-Q, when contaminated distributions (Po,Pi) are 
observed and the true distributions (Pq, Pi) are unknown. Each feasible (no, 7Ti) 
corresponds to a single associated solution (P , Pi). The extremal point (ttq, 7rJ ) 
is the unique point corresponding to a mutually irreducible solution (Pq , P-j*). 
The dashed line indicates the maximal level line (ttq + Hi) = c intersecting with 
A. 
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where v is the estimator of Blanchard et al. ( 2010 ). 



Plugging these estimates into Equations ( 10 ) and ( 11 ), we define the follow 



ing estimates for the Type I and Type II errors: 

R (f,Z?,Z?) = 1 - 4 (/, Z?) - 1 ~ H f > Zf) (W) 

Rx&^Z?) = 1 - R (f, Z?) 1 - ^ l(/ 'f r) - W ' Z ° m) . 
V U 1 " 1-^(Z™,^) 

For brevity, we will sometimes write The following theorem shows that 

the estimators Ri{f) converge uniformly in probability to Ri(f). 

Theorem 2. Let {J~k}j° = i denote a family of sets of classifiers, with Tk having 
finite VC- dimension Let k(m,n) take values in N such that 

Vk(m,n) log(min(m,n)) 
min(m, n) 

as min(m, n) — > oo. If assumptions (A) and (C) hold, then, as min(TO,n) — > 

CO, 

sup \R t (f, Zg l ,Z?)-R i (f)\-Kl 

m probability for i = 0, 1. 

The proof consists of a showing that Ro(f, Z™) and R\(f, Zf) converge uni- 
formly to Ro(f) and Ri(f) (by the VC inequality), that ffj — » 7Ti in probability, 
i = 0, 1 (by the result of Blanchard et al.), and a continuity argument. 

In the next section, we use the estimators Ro and R\ to develop a consistent 
minmax classifier. A similar development should be possible for other criteria 
that depend on Type I and II errors. 

6 Minmax Consistency 

Define the max error of a classifier / as 

R{f) := max{ J R (/),i?i(/)}. (15) 

Let T denote an arbitrary set of classifiers. We define the minmax error over T 
as 

R{T) := mf. R(f). 

Let J-"o denote the set of all classifiers. We will denote the minmax error over 
To as 

R* := inf R(f) = R(T ). 
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Define the estimates of R(f) and R(F) as 

R(f) := max{i? (/),i?i(/)}, 
R{F) := inf %). 

Now let Tfc denote a sequence of positive numbers such that — > as k — > oo. 
Define to be any classifier 

A G {/eft:%)<B(J t ) + r fc }, (16) 

This construction allows us to avoid assuming the existence of an empirical error 
minimizer. 

Let {Fk}^! denote a family of sets of classifiers. The following universal ap- 
proximation property is known to be satisfied for various families of VC classes, 
such as histograms, decision trees, neural networks, and polynomial classifiers. 

(D) For all distributions Q and measurable functions / : X —> {0, 1}, 

lim inf Q(f(X) ± f(X)) = 0. 
fc-+oo feFk 

Theorem [2] gives us control over the estimation error. Condition (D) pro- 
vides control of the approximation error. 

Lemma 4. Let {J'k}'k' = i denote a sequence of classifier sets. If assumption (D) 
holds, then 

lim inf R(f) = R*. 

We can now state the consistency result. This result is comparable in form 
to a classical consistency result in the standard classification setup, see Theorem 
18.1 of Devroye et al. ( 1996 ) where a condition similar to (D), or more precisely 
to Lemma |4j is discussed. 

Theorem 3. Let {-^fc}^! denote a family of sets of classifiers, with Tk having 
finite VC-dimension Vj. Let k(m,n) take values in N such that k(m,n) — > 00 
as min(m, n) — > 00. If 

Vk(m,n) log(min(m,n)) 
min(m, n) 

as min(m, n) — > 00 and assumptions (A), (C), and (D) hold, then R(fk!m,n)) 
R* in probability as min(m, n) — > 00. 

If conditions (A) or (C) fail to hold, our discrimination rule is still consistent 
with respect to the maximally denoised versions of Po and Pi , which always exist 
and are unique by Theorem [TJ In this sense, our analysis is distribution free 
and the consistency is universal. 



16 



The proof of Theorem [3] proceeds by a decomposition into estimation and 
approximation errors (denoting k = k{m,n) for brevity), 

R(fk) -R* = R{fk) - R(T k ) + R(Fk) - R*- 

The approximation error goes to zero by Lemma [4| The estimation error is 
bounded as follows. For the sake of argument, assume R(J- k ) is realized by 
ft e T h . Then 

R(f k ) - R(T k ) = R(f k ) - R{ft) < R(f k ) - R(f* k ) + e<2e, 

where the first inequality holds for any e > 0, with probability going to one, by 
Theorem [ij The second inequality holds by definition of f k , for k sufficiently 
large. See appendix for details. 

7 Additional Perspectives on Mixture Propor- 
tion Estimation 

In this section we provide some simple results that characterize v* (F, H). Proof 
of the following result is embedded in the proof of Proposition 5 of Blan chard| 
et al. (2010) (recalled as Proposition [2] of the current paper), but we reproduce 



it here for convenience. 
Lemma 5. For any distributions F, H on a measure space {X , &), 

v*(F,H) = inf 

v ' ' Aee H{A) 

If F and H are absolutely continuous with respect to Lebesgue measure, with 
densities f and h, then 

v*{F,H)= ess inf (17) 
xes\ipp(H) h(x) 

Proof. We will prove the result for continuous distributions; the general case is 
entirely analogous. Let 

. f m 

7 = ess mi . 
x£supp(H) h(x) 

We need to show (i) 3g such that / = (l—j*)g + j*h, and (ii) if 7 > 7*, then no 
such g exists. To see (i), take g — (/ — j*h)/(l — 7*), which clearly integrates to 
one, and is nonnegative by definition of 7*. To see (ii), suppose that for some 
7 > 7*, there exists a probability density g with / = (1 — 7)5 + 7/1. Then for 
all x such that h(x) > 0, 

f{x) . Mx) » 

TT\ — 7 + (1 ~ n~u~\ >7>7 , 
h[x) h(x) 

which contradicts the definition of 7* . □ 
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Figure 2: Three one-dimensional examples that illustrate assumption (C). In 
each example (row), Pq is on the left (solid line) and Pi on the right (dotted 
line). In the first two examples, (C) is satisfied, but in the third example it is 
not. See text for details. 



Lemma [5] makes it easy to check (C) for various densities. Indeed, two 
densities are mutually irreducible iff the (essential) infimum and supremum of 
their ratio are and oo, respectively. Figure [2] shows three examples where 
X = R. In the first example, Pq and Pi are such that the support of one is 
not contained in the support of the other, and therefore (C) is satisfied. In 
the second example, P and Pi are Gaussian distributions with equal variances 
and unequal means. By plugging in the formulas for the Gaussian densities, it 
is easy to verify that (C) is again satisfied. In the third example, Pq and Pi 
are again Gaussian densities with unequal means, but this time with unequal 
variances. In this case, it is again not hard to show that v*(Pq,P\) = 0, but 
v*(Pi,Pq) > 0, where Pi has the larger variance. Thus, (C) is not satisfied in 
this case. We do note, however, that v*{P\,Pq) tends to zero very fast as the 
means move apart. 

For the following result, let F and H be two continuous distributions with 
densities / and h. Lemma [5] allows us to characterize v*(F,H) in terms of the 
ROC of the LRT. 

Proposition 3. Assume that the ROC of the likelihood ratio tests x i— > ^-{f{x)/h(x)>-y} 
is left- differ entiable at (1,1). Then u*(F,H) is the slope (left-derivative) of the 
ROC at (1,1). 

Proof. The slope of the ROC of an LRT with threshold 7 is equal to 7 wherever 



the slope is well defined (Peterson et al. 1954 Scharf 1991). The right end- 



point of the ROC corresponds to 7* = ess inf x esupp(H) That is, for all 

7 > 7*, the Type I error of the LRT is strictly less than 1, whereas it equals 1 
at 7*. □ 



This result provides intuition for the estimator of v* (F, H) studied by 



Blan- 




Figure 3: The receiver operating characteristic of the likelihood ratio test 
x i y l{/(x)//t(ai)>7}' The curve traces the points (H({x \ f{x)/h(x) > 
7}), F({x I f(x)/h(x) > 7}) as the threshold 7 varies. The upper right cor- 
ner corresponds to 7 = v*(F, H). The slope of the dashed line, which is tangent 
to the ROC at the upper right corner, is equal to v*(F,H). 



chard et al. (20101, which can be understood as estimating the slope of the ROC 



at its right endpoint. See Figure |3j This is a more direct method of estimation 
compared to the "plug-in" estimate of v*(F,H) that proceeds by estimating 
the densities / and h, plugging these into the expression in the Lemma [3J and 
minimizing. 

We conclude this section by remarking that 1 — v* (F, H) is an example of 
an information divergence, like the Kullback-Leibler divergence. In particular, 
1 — v*{F,H) is always nonnegative, and it equals zero if and only if F = H, 
by Proposition [2] Furthermore, Lemma [5] states that this divergence can be 
expressed in terms of the likelihood ratio, like KL and other information di- 
vergences. On the other hand, for other information divergences, the likelihood 
ratio appears in an integral, whereas here we have an infimum. This information 
divergence has been studied previously for discrete distributions in the analysis 



of Markov chains ( Aldous and Diaconis| 1987), where it is called the "separation 
distance." In general, v*(F,H) ^= v*(H,F), so that this is not actually a metric 
on distributions. 
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In the next section, we leverage Lemma [5] to connect mutual irreducibility 
to class probability estimation. 



8 Mutual Irreducibility and Class Probability 
Estimation 

In this section,, we relate mutual irreducibility of Pq and P\ to the problem 
of class probability estimation. We assume that Pq and Pi are continuous 
distributions with densities Po(x) and Pi(x). We further assume that the feature 
vector X and label Y are jointly distributed with joint distribution Q, and that 
q := Q(Y = 1) e (0, 1). The posterior probability that Y = 1 is denoted 

7](x) :=Q(Y = l\X = x). 

The problem of estimating r/ from data is known as class probability estimation 
(Bujaetal. 2005| [Reid and Williamson 2010). The most well-known approach 
to class probability estimation is logistic regression, which posits the model 

^ = l + cxp{-(w T x + b)}' 

where w and x have the same dimension, and i e M. The parameters w and 
b are fit to the data by maximum likelihood. More generally, estimates for 77 
commonly have the form 

9j(x) = ^- 1 (h(x)) 

where ip : [0, 1] 1— > K is a link function, and h is a decision function of some sort. 
Now define 

Vmin '■= ess inf ^(x) and ?7max : = ess sup r]{x). 

The following result connects the posterior class probability to mutual irre- 
ducibility. 

Proposition 4. With the notation defined above, 

w = TT^M^j (18) 

and 

Vmin = 1 ~ l + T^v*(Po,PlY (19) 

Therefore, Pq and Pi are mutually irreducible if and only if ?7min = and 
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Proof. By Bayes' rule, it is true that almost everywhere, 



r)(x) 



qpi(x) 

qpi(x) + (1 - q)Po(x) 
1 



9 PlO) 



Equation ( 18 ) now follows from Lemma [5j Similarly, we have (almost every- 
where) 



7}(x) = 1 
= 1 



(i - q)po(x) 



(1 - <l)Po(x) + qpi(x) 

1 



1-9 Po(z) 



Now ( 19 ) follows from Lemma[5j The final statement follows from ( 18 ) and ( 19 1 
and the definition of mutual irreducibility. □ 

Thus, estimates of v*(P$, Pi) and v*(Pi, Pq) could be used to inform choices 
about the design of the link function and model class of decision functions. 

Proposition [4] also suggest another possible approach to mixture proportion 
estimation. Suppose rj is an estimator for r] that is consistent with respect to 
the supremum norm, and let q be the empirical estimate of q based on a random 



sample from Q. Inverting Equation (181 



1 



- 1 



1-? 



is a consistent estimate of v*(Pi,P ). Similar remarks apply to v*(P a ,Pi). Al- 
though this suggests that class probability estimation solves mixture proportion 
estimation in the binary classification context, we note that sup-norm consis- 
tency will require distributional assumptions, and therefore the distribution-free 
estimator of Blanchard et al. is a more general solution. 



9 Conclusion 

We have argued that consistent classification with label noise is possible if a 
majority of the labels are correct on average, and the class-conditional distribu- 
tions Pq and Pi are mutually irreducible. Under these conditions, we leverage 



results of Blanchard et al. (2010) on mixture proportion estimation to design 
consistent estimators of the false positive and negative probabilities. These es- 
timators are applied to establish a consistent minmax classifier, and it seems 
clear that other performance measures could be analyzed similarly. Unlike pre- 
vious theoretical work on this problem, we allow that the supports of Pq and 
Pi may overlap or even be equal, the noise is asymmetric, and that the perfor- 
mance measure is not the probability of error. We also argued that requiring 
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mutual irreducibility can be equivalently seen as aiming at maximum denois- 
ing of the contaminated distributions, or maximum separation of the unknown 
sources Pq,Pi for given contaminated distributions. Thus, our discrimination 
rule is universally consistent in the sense that its performance tends to the op- 
timal performance corresponding to the maximally denoised P\,Pq, regardless 
ofP ,A. 
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A Remaining Proofs 
A.l Proof of Proposition [I] 

Proof. First note that under (A), A is well-defined and nonnegative. Solving 
for 7 we obtain 

_ A(l - 7Tq) - 7Tl 
1 — 7Tl — A7T() 

The denominator in this expression is positive, which can be seen as follows. 

_ 7Ti + 7(1 - 7Tl) 
1 - TTO + 7 7r 
1 - 7TQ + 7(! ~ TTl) 

1 - 7T0 + 7 7r 
7(1 - TTl) 
77T 

The first inequality follows from (A), while the second follows from the fact 
that the mapping 1 1— > (a + t)/(b + t) is strictly decreasing in t > when a > b. 
Here a — 7(1 — 7Pl) and b — jttq. 
Therefore, 

Pi(x) pi(x) A(l - 7r ) - 7Tl 

P0(X) Po(x) l-7Tl-A7r 

[1 - 7Ti - \ir a ]pi(x) > [A(l - 7r ) - ni]po(x) 

<^=> (1 - 7Tl)pi(x) + 7Tlp (x) > A[(l - TTo)po(x) + 7T Pl (x)} 




> A. 



□ 
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A. 2 Proof of Theorem [T] 

Proof. By Lemmas [I] and [2j feasible quadruples (tt , m, Pq, Pi) for decomposi- 
tions (JTJ- (|2j) under condition (A) are in one-to-one correspondence with feasible 
quadruples (tt , ffi, Pq, Pi) for decompositions ([6|-([7| ■ 

Define 7Tq := u*(P\, Pq). Proposition [2] applied to ^ easily implies that for 
any value 7r S [0>^o] > there exists a unique P such that (tto,Po) satisfies 
also, the solution (ttq , P * ) corresponding to the maximal feasible value of ttq is 
the unique one satisfying (B). A similar conclusion is valid concerning solutions 
of 0. 

Therefore, the feasible region R for proportions (ttq, 7Ti) in the original model 
<[TJ)-([2]) is obtained as the image of the rectangle [0, ttq] x [0, 7^] via the above one- 
to-one correspondence. Using the explicit expression for (ni,n ) of Lemma [I] 
the constraints ( |13| simply translate the equivalent constraints no < ttoj ^1 — 
1 ■ 

Since by Lemma [3], under the assumption (A) conditions (B) and (C) 
are equivalent, then again via the above correspondence, we get existence and 
unicity of (ttq, it*, Pq , P*) for the original formulation Q-Q, under condition 
(C). The explicit expression @ for ) is obtained via Lemma [2] 

The equality 7r + n\ = 1 ^ ^^i-^J " 1 implies that ttq + ix\ is a monotone 
(strictly) increasing function of tti and 7r . Therefore, the maximum of ttq + tt\ 
can only be reached when both (7Ti,7To) take their maximum value. Since the 
latter values are attained for the unique feasible quadruple (ttq, tt*, Pq, P*) in 
the decoupled problem, the corresponding maximum of ir + tti for the original 
formulation is also uniquely attained for the quadruple (7Tq , tt\ , P * , P*) . 

Finally, by subtracting ([lj from ([2]), we obtain the relation 

(P - Pq) = (1 - ttq - TTi)- 1 ^ - Pq) 

implying 



Pi - P || TV = (1 - 7T - 7Tl)~ 



P1-P0 



TV 



Therefore, the maximum (over A) of the total variation distance ||Pi — Po|| T y 
is precisely attained for the maximum value of (no +7Ti), and hence corresponds 
to the unique mutually irreducible solution. □ 

A.3 Proof of Theorem d 

The following two lemmas allows us to deduce uniform convergence of Ri from 

uniform convergence of Po and Pi, and consistency of ttq, and 77 1 . They will be 
used in the proof of Theorem 1. 

Lemma 6. Let {J-j}f^i denote a sequence of classifier sets, with J-j having 
finite VC- dimension Vj. Let k(m,n) take values in N such that 

Vfc(m,n)log(min(m,n)) ^ q 
min(m, n) 
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The 



in probability, and 



sup \R {f,Z™)-R {f)\ -> 0, 

/e^ (m ,„. 



sup ^(/^H-^iC/)! 







m probability. 

Proof. Let fc = fc(m, n). We must show that for all e > 



. lim P m ( sup \R (f, Z™) - R (f)\ > e) = 

min(m,nj— >oo f^J^k 



and 



lim 

min(m,n)^'00 



P?( sup | J R 1 (/ ! zn-4(/)|>e) = o. 



Let ^ = min(m, ri) and e > 0. By Theorem 12.5 in Devroye et al. (1996), 
it suffices to show t hat 8s(J-fc, £)e~ ie Z 32 — > 0, as £ — ¥ oo. Theorem 13.3 in 
Devroye et al. ( 1996 ) provides £ Vk as an upper bound on the shatter coefficient. 



Therefore, we have 



= Q p -& 2 /32+VUog(£) 



This term final term clearly goes to zero by (20). 



□ 



Lemma 7. (Extension of Continuous Mapping Theorem) Let Qq, Q\ be proba- 
bility distributions. Let To denote the set of all classifiers, and {J 7 j} < ^L 1 denote a 
family of sets of classifiers. Let k(m, n) take values in N such that k(m, n) — > oo 
as min(m, n) — > oo. Denote k = k(m,n). Let 



A 
D 
A 
B 



Jo x 
To x X 



X m x X n _^ ■ 

x x n -> : 



Assumesup feFk \A(f, Zf , Z?)-A(f)\ -> andsu VfeFk \B(f,Z^,Zf)-B(f)\ -4 
0, in probability, where Z™ and Z™ are iid random samples governed by the prod- 
uct measures Q™ and Q\. If g : ft C K x K — > K is continuous at (A{f) 1 B{f)) 
for all f G Jo, tten as min(m, n) ->■ oo, su P/e ^ fc |<?(A(/, Z », %, 
gr(A(/), ->• m probability. 
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Proof. For an arbitrary / and samples of sizes m and n, by the definition of 
continuity, for all e > 0, there exists a 6 t > such that 

\\(A(f, Z™, Z?),B(f, Z™, ZD) - (A(f),B(f))\\ 2 < 26, 

=> \g(A(f,Z™,Z?),B(f,Z™,Z?))-g(A(f),B(f))\<e. 

Since || • ||i > || • || 2 , it follows that 

\\(A(f, Z?, Z?),B(f, Z™, ZD) (A(f), < 26, 

==► \g(A(f,Z?,Z?),mZ?,Z?))-g(A(f),B(f))\<e. 

From this, we can conclude that 

sup \\(A(f,Z?,Z?),B(f,Z?,Z?))-(A(f),B(f))\\ 1 <26 e 

=> sup \g(A(f, Z™, Z?),B(f, Z™, Z?))) - g(A(f), B(f))\ < e. 

for all m, n. Now, 

< Q^®Q?(sup \\(A(f,Z™,Z?),B(f 1 Z™,Z{ l ))-(A(f),B(f))\\ 1 >26 c ) 
= QZ l ® Q?( sup Z m , Zl) - A(f)\ + |%, Z™, Z») - > 2S e ) 

< QZ l ® Q?( sup Z™, - > * £ ) 

+ QS l ®Q?(sup |%,Z "\Z?)-S(/)|><y. 

Taking the limit as min(m, n) — > oo takes the last inequality to 0, based on our 
assumption of convergence in probability. Therefore, we have 

lim Q^QU sup \\(A(f,Z^\ZD,B(f,Z^,ZD)-(A(f),B(f))\\i < 26 e ) 

min (rn,n)— foo f^Fk 



It follows from a previous implication that 

0™ ® Q?( sup \\(A(f, Z™, Z^), B(f, Z™, ZD) - {A(f),B(f))\\i < %>e) 
<Q^®Q?(sup IjW,^,^)^^^,^)) - »(A(/),B(/))| < e). 



Combining this inequality with equation (21) yields, 



lim Qo , ®Qi( sup \g(A(f,Z^,ZD,B(f,Z^,ZD))-9(A(f),B(f))\ < e) 

mm(m,n)— foo f^Fk 

and the result follows. □ 
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We will prove the theorem for i — 0, the other case being similar. 

Proof. Let k — k(m,n) for brevity. Substituting equations (7) and (9) into the 
following subtraction yields 



R Q (f,Z™,Z?)-R (f) = l-R 1 (f,Z^)- 



l-R (f,Z™)- Ri^ZT) 



= Ri(f)-Ri{f,Z?)- 

| I - R (f) - RM) 
1 - fro 

Take e > 0. By Lemma [6j we have that 



1 - fro 

i-^. (/,z m )-4(/, zr 

l-f?o(^o m ^") 



Li := 
Now consider the following, 



lim P x "(sup >-)=0. 

min(m,n)—foo f^Fk 



sup 



1-P (/) l-i?o(/,^ m ) 



1 - fro 



l-fr (Z™,Z?) 

= sup |^(i?o(/),7r )-M^o(/,^o m )^o(^",^r))|, 



where h(x, y) = (l — x)/(l — y). This function is continuous on O = K x (R\{0}). 
By (A) and Lemma [I] we have that fro < 1 and therefore this function is 
continuous at (Ro(. f), fro). By (C), ffp = u *{Pq,P\). Furthermore, Theorem 8 



of Blanchard et al. 
in probability to fro, and by Lemma 



Blanchard et al. 



J2010| implies that jr (Z^,Z^) converges 
6j we have that 



sup |P o (/,^ o m )-P o (/)|^0, 

in probability. Thus, the conditions of Lemma [7] are met with A(f,Z[] 1 , Z[ l ) — 

P (/,Z m ), and B(f,Z r n ,Z[ l ) = f? (Z^,Z 1 "). By applying Lemma Q we con- 
clude that 



lim P m ®Pr( sup \h(R (f),n )~h(R Q (f,Z™),K (Z™,Z?))\ > -) = 0. 

miii(m.n)— > oo f(z p k o 

So we now define 



L 2 := lim P m <g>P"(sup 

min(m,n)— >oo f^J^k 



l-Ro(f) l-P (/,^o m ) 



1 - fro 



l-7r (Z™,^) 



>3)=0. 



2G 



A similar argument can be made to show that 



min(m,n)— >-oo f^J^k 



Rx(f) -Ri(.f,z? 



L 3 := lim P m ^P 1 n ( sup - ^ i; > -) = 



1-7T0 l-7f (Zff*,^) 



e , 



3' 



We conclude the proof by applying the triangle inequality, 



lim P m ® Pf( sup \R (f, Z m , Z?) - Ro(J)\ > e) < L 1+ L 2 + L 3 

min(m,n)— ¥00 fdJ^k 

= 0. 

□ 

A. 4 Proof of Lemma [U 

Proof. Let e > and let / € Jo be a measurable function such that R(f) < 
R* + §. Also let A and V denote logical "and" and "or". Take P = \P Q + \P X . 
By assumption (D), there exists a ko £ N, such that for every k > ko there 
exists a f £ Tk such that 

P(/(X) ^ /(X)) < |. 

Combining this with the definition of P yields, for such /, 

Po(f(X) + /(X)) < 2P(/(X) ^ /(X)) 
e 

< 2- 



Therefore, for all k > ko, there exists a / € -Ffe such that 
| > P (/(X) ^ /(X)) 

= P ((/(X) = 1 A /(X) = 0) V (/(X) = A /(X) = 1)) 

= P (/(X) = 1 A /(X) = 0) + P (/(X) - A /(X) = 1) 

> P (/(X) = 1 A /(X) = 0) - P (/(X) = A /(X) = 1) 

= Po(/(X) = 1)-P (/(X) = 1) 

- Ro(f)-Mf)- 

In the same manner, it can be shown that e/2 > Ri(f) — Rx(f*) for the same 
/ £ J-k- This establishes the existence for all k > fco of a / £ Tk such that 

P(/) = max{P (/),P 1 (/)} < max{P (/),Pi(/)} + | 

= R(f) + I 

< R* + e. 

Since e was arbitrary the result now follows. □ 
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A. 5 Proof of Theorem [3] 

Proof. Let e > 0, S > 0, and k = k(m,n). We need to show that for m, n 
sufficiently large, 

P^®P?(R(f k )-R* <e) > 1-S. 
Consider the decomposition 

R(fk) - R* = R(fk) - R{F k ) + R{F k ) - R*- 

Lemma [4] implies that for to and n significantly large, P(-7-fc) — R* < e/2. We 
will now bound the R(f k ) — R(F k ) term. By the definition of i?(J r fe), there exists 
fl G T k such that R(f*) < R(T k ) + e/8. It follows that 

R(f k )-R{Fk) < R(fk) (R(fD - 1) 

= max{P (A)^i(A~)} - max{i? (/ fc *),iii(/ fc *)} + |-(22) 

It follows by Theorem 1 that for to, n sufficiently large, we have 
P™ ® P*( sup |P (/) - P (/)l > J) < 5/2 
P™ ® Pf ( sup |P!(/) - P x (/)| > i) < 5/2. 

Assume that both 

|Po(/)-Po(/)| < | for all Je^ 
|Pi(/)-Pi(/)| < | forall/GJ-fe, 

which by the result just stated, occurs with probability at least 1 — 5 for to and 
n sufficiently large. It follows that 

max{P (/ fc ),P 1 (/ fc )} < max{P (/ fc ),P 1 (/ fe )} + | 

and 

max{Ro(fk),Ri(fk)} > max{Ro{fk), Mfk)} ~ J- 



Using these inequalities in Equation ( 22 ) yields 



R(f k ) - R{F k ) < max{i?o(/ fe ),i?i(/fc)} + | - (max{P (/ fe *), -Ri(/ fe *)} - |) + |- 



From our definition of in Equation (16), for to and n sufficiently large we 
have 

max{P (/ fc ),Pi(/ fe )} < max{P (ft), + 



Therefore, we can conclude that 

R(f k )-R(F k ) < 6 -, 
with probability at least 1 — 6. Thus, we conclude that 
P^®P?(R(f k )-R* <e) > 1-5, 
for m and n sufficiently large. □ 
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